The features were extracted from the silhouettes by the HIPS (Hierarchical Image Processing System) extension BINATTS, which extracts a combination of scale independent features utilising both classical moments based measures such as scaled variance, skewness and kurtosis about the major/minor axes and heuristic measures such as hollows, circularity, rectangularity and compactness.
Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.
The objective is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.
1- compactness
2- circularity
3- distance_circularity
4- radius_ratio
5- pr.axis_aspect_ratio
6- max.length_aspect_ratio
7- scatter_ratio
8- elongatedness
9- pr.axis_rectangularity
10- max.length_rectangularity
11- scaled_variance
12- scaled_variance.1
13- scaled_radius_of_gyration
14- scaled_radius_of_gyration.1
15- skewness_about
16- skewness_about.1
17- skewness_about.2
18- hollows_ratio
19- class
#For numerical libraries
import numpy as np
#To handle data in the form of rows and columns
import pandas as pd
#importing seaborn for statistical plots
import seaborn as sns
#importing ploting libraries
import matplotlib.pyplot as plt
#styling figures
plt.rc('font',size=14)
sns.set(style='white')
sns.set(style='whitegrid',color_codes=True)
#To enable plotting graphs in Jupyter notebook
%matplotlib inline
#importing the Encoding library
from sklearn.preprocessing import LabelEncoder
#Build the model with the best hyper parameters
from sklearn.model_selection import cross_val_score
#importing the zscore for scaling
from scipy.stats import zscore
#Importing PCA for dimensionality reduction and visualization
from sklearn.decomposition import PCA
# Import Support Vector Classifier machine learning library
from sklearn.svm import SVC
#Import Sklearn package's data splitting function which is based on random function
from sklearn.model_selection import train_test_split
#Grid search to tune model parameters for SVC
from sklearn.model_selection import GridSearchCV
# Import the metrics
from sklearn import metrics
#reading the CSV file into pandas dataframe
vehicle_df=pd.read_csv('vehicle.csv')
#Check top 5 records of the dataset
vehicle_df.head()
#Check the last 5 records of the dataset
vehicle_df.tail()
#To show the detailed summary
vehicle_df.info()
#Analyze the distribution of the dataset
vehicle_df.describe().T
By analysing it, we can see that
-compactness, circularity, distance_circularity, elongatedness, pr.axis_rectangularity, max.length_rectangularity, scaled_radius_of_gyration, scaled_radius_of_gyration.1, skewness_about.2, hollows_ratio are approximately normally distributed.
-radius_ratio, pr.axis_aspect_ratio, max.length_aspect_ratio, scatter_ratio, scaled_variance, scaled_variance.1, skewness_about, skewness_about.1 are approx. right skewed distribution.
#It shows data types of columns
vehicle_df.dtypes
#class attribute is not an object it is a category
vehicle_df['class']=vehicle_df['class'].astype('category')
#To get the shape
vehicle_df.shape
#To get the number of columns
vehicle_df.columns
#Checking for missing values in the dataset
vehicle_df.isnull().sum()
#replace missing variable('?') into null variable using numpy
vehicle_df = vehicle_df.replace(' ', np.nan)
#Replacing the missing values by median
for i in vehicle_df.columns[:17]:
median_value = vehicle_df[i].median()
vehicle_df[i] = vehicle_df[i].fillna(median_value)
# again check for missing values
vehicle_df.isnull().sum()
# Again check data information
vehicle_df.info()
# Understand the spread and outliers in dataset using boxplot
vehicle_df.boxplot(figsize=(35,15))
It is showing that there are some columns which contains outliers such as radius_ratio, pr.axis_aspect_ratio, max.length_aspect_ratio, scaled_variance, scaled_variance.1, skewness_about, skewness_about.1.
# Histogram
vehicle_df.hist(figsize=(15,15))
#find the outliers and replace them by median
for col_name in vehicle_df.columns[:-1]:
q1 = vehicle_df[col_name].quantile(0.25)
q3 = vehicle_df[col_name].quantile(0.75)
iqr = q3 - q1
low = q1-1.5*iqr
high = q3+1.5*iqr
vehicle_df.loc[(vehicle_df[col_name] < low) | (vehicle_df[col_name] > high), col_name] = vehicle_df[col_name].median()
# again check for outliers in dataset using boxplot
vehicle_df.boxplot(figsize=(35,15))
print('Class: \n', vehicle_df['class'].unique())
vehicle_df['class'].value_counts()
sns.countplot(vehicle_df['class'])
#Encoding of categorical variables
labelencoder_X=LabelEncoder()
vehicle_df['class']=labelencoder_X.fit_transform(vehicle_df['class'])
#correlation matrix
cor=vehicle_df.corr()
cor
# correlation plot---heatmap
sns.set(font_scale=1.15)
fig,ax=plt.subplots(figsize=(18,15))
sns.heatmap(cor,vmin=0.8, annot=True,linewidths=0.01,center=0,linecolor="white",cbar=False,square=True)
plt.title('Correlation between attributes',fontsize=18)
ax.tick_params(labelsize=18)
#pair panel
sns.pairplot(vehicle_df,hue='class')
#independent and dependent variables
X=vehicle_df.iloc[:,0:18]
y = vehicle_df.iloc[:,18]
# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 10)
model = SVC()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
# check the accuracy on the training data
print('Accuracy on Training data: ',model.score(X_train, y_train))
# check the accuracy on the testing data
print('Accuracy on Testing data: ', model.score(X_test , y_test))
#Calculate the recall value
print('Recall value: ',metrics.recall_score(y_test, prediction, average='macro'))
#Calculate the precision value
print('Precision value: ',metrics.precision_score(y_test, prediction, average='macro'))
print("Confusion Matrix:\n",metrics.confusion_matrix(prediction,y_test))
print("Classification Report:\n",metrics.classification_report(prediction,y_test))
#Store the accuracy results for each kernel in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Model':['SVM'], 'Accuracy': model.score(X_test, y_test)},index={'3'})
tempResultsDf
result=[]
print("Accuracy with out applying PCA :",end=" ")
print(tempResultsDf["Accuracy"][0])
result.append(tempResultsDf["Accuracy"][0])
#Build the model with the best hyper parameters
model = SVC(C=0.5, kernel="linear")
scores = cross_val_score(model,X, y, cv=10)
print(scores)
print(np.mean(scores))
print("Cross validation score with out applying PCA :",end=" ")
print(np.mean(scores))
result.append(np.mean(scores))
# Scaling the independent attributes using zscore
X_z=X.apply(zscore)
# prior to scaling
plt.rcParams['figure.figsize']=(10,6)
plt.plot(vehicle_df)
plt.show()
#plt.plot(X_z,figsize=(20,10))
plt.rcParams['figure.figsize']=(10,6)
plt.plot(X_z)
plt.show()
# Calculating the covariance between attributes after scaling
cov_matrix = np.cov(X_z.T)
print('Covariance Matrix \n%s', cov_matrix)
#Finding eigenvalues amd eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print('Eigen Vectors \n%s', eigenvectors)
print('\n Eigen Values \n%s', eigenvalues)
# Make a set of (eigenvalue, eigenvector) pairs
eigen_pairs = [(np.abs(eigenvalues[i]), eigenvectors[:,i]) for i in range(len(eigenvalues))]
eigen_pairs.sort(reverse=True)
eigen_pairs[:]
# print out eigenvalues
print('Eigenvalues in descending order: \n%s' %eigenvalues)
tot = sum(eigenvalues)
var_exp = [( i /tot ) * 100 for i in sorted(eigenvalues, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)
plt.plot(var_exp)
# Ploting
plt.figure(figsize=(8 , 7))
plt.bar(range(1, eigenvalues.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, eigenvalues.size + 1), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()
# Reducing from 17 to 10 dimension space
pca = PCA(n_components=10)
data_reduced = pca.fit_transform(X_z)
data_reduced.transpose()
pca.components_
X_comp = pd.DataFrame(pca.components_,columns=list(X_z))
X_comp.head()
# P_reduce represents reduced mathematical space.
# Reducing from 17 to 10 dimension space
P_reduce = np.array(eigenvectors[0:10])
# projecting original data into principal component dimensions
X_std_10D = np.dot(X_z,P_reduce.T)
# converting array to dataframe for pairplot
Proj_data_df = pd.DataFrame(X_std_10D)
#Let us check it visually
sns.pairplot(Proj_data_df, diag_kind='kde')
# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(Proj_data_df,y, test_size = 0.3, random_state = 10)
model = SVC()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
# check the accuracy on the training data
print('Accuracy on Training data: ',model.score(X_train, y_train))
# check the accuracy on the testing data
print('Accuracy on Testing data: ', model.score(X_test , y_test))
#Calculate the recall value
print('Recall value: ',metrics.recall_score(y_test, prediction, average='macro'))
#Calculate the precision value
print('Precision value: ',metrics.precision_score(y_test, prediction, average='macro'))
print("Confusion Matrix:\n",metrics.confusion_matrix(prediction,y_test))
print("Classification Report:\n",metrics.classification_report(prediction,y_test))
#Store the accuracy results for each kernel in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Model':['SVM'], 'Accuracy': model.score(X_test, y_test)},index={'4'})
tempResultsDf
print("Accuracy with out applying PCA :",end=" ")
print(tempResultsDf["Accuracy"][0])
result.append(tempResultsDf["Accuracy"][0])
#Grid search to tune model parameters for SVC
from sklearn.model_selection import GridSearchCV
model = SVC()
params = {'C': [0.01, 0.1, 0.5, 1], 'kernel': ['linear', 'rbf']}
model1 = GridSearchCV(model, param_grid=params, verbose=5)
model1.fit(X_train, y_train)
print("Best Hyper Parameters:\n", model1.best_params_)
#Build the model with the best hyper parameters
model = SVC(C=0.5, kernel="linear")
scores = cross_val_score(model, Proj_data_df, y, cv=10)
print(scores)
print(np.mean(scores))
#Store the accuracy results for each kernel in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Model':['SVM k fold'], 'Accuracy': np.mean(scores)},index={'5'})
tempResultsDf
print("Cross validation score with PCA :",end=" ")
print(np.mean(scores))
result.append(np.mean(scores))
result
print("Accuracy score without PCA :",result[0])
print("Cross validation score without PCA :",result[1])
print("Accuracy score with PCA :",result[2])
print("Cross validation score with PCA :",result[3])
Accuracy is getting increased when PCA is implemented
Cross validation score is getting decreased when PCA is implemented
This may be bacuse PCA tunes the model to much better state to face the test data.